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Abstract. Given a string w over a finite alphabet U and an integer 
can w be partitioned into strings of length at most such that there are 
no collisions! We refer to this question as the string partition problem 
and show it is NP-complete for various definitions of collision and for a 
number of interesting restrictions including \U\ —2. This establishes the 
hardness of an important problem in contemporary synthetic biology, 
namely, oligo design for gene synthesis. 

1 Introduction 

Many problems in genomics have been solved by the application of elegant 
polynomial-time string algorithms, while others amount to solving known NP- 
complete problems; for instance, sequence assembly amounts to solving shortest 
common superstring [11], and genome rearrangement to sorting strings by rever- 
sals and transpositions [2]. The hardness of these problems has motivated ex- 
tensive research into heuristic algorithms as well as polynomial-time algorithms 
for useful restrictions [6,10,19,9,8,14,16]. In a similar vein, we establish the 
hardness of the following fundamental question: can a string be partitioned into 
factors (i.e. substrings), of bounded length, such that no two collide? We refer 
to this as the string partition problem and study it under various restrictions 
and definitions of what it means for two factors to collide. 

The study of string partitioning is motivated by an increasingly important 
problem arising in contemporary synthetic biology, namely gene synthesis. This 
technology is emerging as an important tool for a number of purposes including 
the determination of RNAi targeting specificity of a particular gene [12], design of 
novel proteins [5] and the construction of complete bacterial genomes [7] . There 
have been numerous studies utilizing synthetic genes to determine the potential 
of gene vaccines [13,3,17,1]. Despite the tremendous need for synthetic genes 
for both interrogative studies and for therapeutics, construction of genes, or any 
long DNA or RNA sequence, is not a trivial matter. Current technology can only 
produce short oligonucleotides (oligos) accurately. As such, a common approach 
is to design a set of oligos that could assemble into the desired sequence [18]. 

To understand the connection between string partitioning and gene synthesis, 
consider the following. A DNA oligo^ or strand is a string over the four letter 
alphabet {A, C,G, T}. The reverse complement F' of an oligo F is determined 
from F by replacing each A with a T and vice versa, each C with a G and 



vice versa, and reversing the resulting string. Two DNA oligos F and are 
said to hybridize if a sufficiently long factor of F is the reverse complement of 
a factor of F' (see Figure 1). A DNA duplex consists of a positive strand and 
its reverse complement, the negative strand. The collision- aware oligo design for 
gene synthesis (CA-ODGS) problem is to determine cut points in the positive 
and negative strands, which demarcate the oligos to be synthesized, such that 
the resulting design will successfully self- assemble. For the oligos to self-assemble 
correctly, they should 1) alternate between the positive and negative strands, 
with some overlap between successive oligos, and 2) only hybridize to the oligos 
they overlap with by design. Since there is variability in the length of the selected 
oligos, there are exponentially many designs. 




Fig. 1. An intended self-assembly (top) of a set of oligos for a desired DNA duplex. A 
foiled self-assembly (bottom) of the same oligos due to d and h being identical. 



In previous work [4], the authors provided some evidence that the CA-ODGS 
problem may be hard by showing that partitioning a string into factors, of 
bounded length, such that no two are equal is NP-complete, even for strings 
over a quaternary alphabet. See Figure 1 for an example design that assembles 
incorrectly into two fragments, with the wrong ordering of oligos and therefore 
primary sequence, due to identical oligos. In this work, we study the underly- 
ing string partition problem in much greater detail. We show that partitioning 
strings such that no selected string is a copy /factor /prefix/suffix of another is 
NP-complete. We begin by showing that the more general problem of partition- 
ing a set of strings is hard and then we show how those instances can be reduced 
to single string instances, for each respective definition of collision. See Figure 2 
for an example of a single string instance (left) and set of strings instance (right). 
In all cases, we demonstrate the problems remain hard even when restricted to 
binary strings. 



a c__g g g__a t_ 
_c c t__a g c__g g a 

cagggcta 

Fig. 2. (Left) Two partitions are shown for the string mississippi. The selected strings 
in both partitions have maximum length 2. The partition shown above the string is 
factor-free: no selected string is a factor of another; however, the partition shown below 
the string is not factor-free. (Right) A valid factor-free multiple string partition of a 
set of three strings into selected strings of maximum length 3. 

2 Preliminaries 

A string w is a. sequence of letters over an alphabet E. Let \w\ denote the length 
of a mirror image (reversal) of and let {wY denote the string w repeated 

i times. The empty string is denoted as e. String x is a factor of w if w = 
for some (possibly empty) strings a and p. Similarly, x is a prefix (suffix) of w if 
w = xj3 {w = ax) for some (possibly empty) strings a and j3. The prefix (suffix) 
of length k of w will be denoted as prefix^ (k;) {suf^xj^{w)). 

A K-partition of is a sequence P = pi,p2, • • • for some where each pi 
is a string over U of length at most K and w = piP2 - - - Pi- We say that strings 
Pi, . . . are selected in the i^-partition and that strings Pi . . . Pj , I < i < j < 
are super- selected, with respect to the selected strings. We say P is equality- 
free, prefix- free, suffix- free, or factor- free if for all j, 1 < i j < I, neither 
Pi nor pj is a copy, prefix, suffix, or factor, respectively, of the other. We say 
such partitions are valid (for the problem in question); otherwise, we say the 
partition contains a collision. We generalize the notion of a K-partition to a 
set of strings W to mean a i^-partition for each string in W. The length of W 
is the combined length of the strings in the set and will be denoted by ||W||. 
A i^-partition for a set of strings is valid if no two elements in any, possibly 
different, partition collide. Finally, we will refer to the boundaries of a partition 
of string w as cut points, where the first cut point and the last cut point \w\ 
are called trivial. For instance, the first partition of mississippi in Figure 2 has 
the following non-trivial cut points 1, 3, 5, 7 and 9. 

In what follows we will prove NP-completeness of various string partition- 
ing problems by showing a polynomial reduction from an arbitrary instance of 
3SAT(3), a problem shown to be NP-complete by Papadimitriou [15]. 

Problem 1 (3SAT(3)). 

Instance: A formula (j) with a set C of clauses over a set X of variables in 
conjunctive normal form such that: 

1. every clause contains two or three literals, 

2. each variable occurs in exactly three clauses, once negated and twice positive. 



mississippi 




Question: Is (j) satisfiable? 



3 The String Partition Problems 



PF-MSP(L=2) PF-SP(L=2) 

PF-MSP(K=2) 

PF-SP(K=2) 



FF-SP(K=3) 

3SAT(3) FF-MSP(K=3) 

FF-MSP(L=2) FF-SP(L=2) 



EF-SP(K=2) 

EF-MSP(K=2) 

EF-MSP(L=2) EF-SP(L=2) 

Fig. 3. Chain of reductions for different string partition variations from original 
3SAT(3) problem. K is maximum selected string size and L is maximum alphabet size. 
Parameters are unbounded if not shown. EF^ FF and PF are equality-free, factor-free, 
and prefix (suffix) -free, respectively. 

For each X in {equality, prefix, suffix, factor}, we will consider two string 
partition problems. 

Problem 2 (X-Free Multiple String Partition (X-MSP) Problem). 

Instance: Finite alphabet U of size L, a positive integer and a set of strings 

W over r*. 

Question: Is there an Af-free, iC-partition P of W? 
Problem 3 (X-Free String Partition (X-SP) Problem). 

Instance: Finite alphabet E of size L, a positive integer and a string w over 

Question: Is there an Af-free, iC-partition P of wl 

We will show NP-completeness of all these problems even when restricted to 
the constant size of the partition {K = 2, 3), or to the binary alphabet (L = 2). 
See Figure 3 showing the chain of reductions used to prove the complexity of 
the three variations and related restrictions of the problem. 

4 Equality-Free String Partition Problems 

4.1 Equality- Free Multiple String Partition with Unbounded 
Alphabet 

We now describe a polynomial reduction from 3SAT(3) to EF-MSP with K = 
2 and unbounded alphabet. Let (j) be an instance of 3SAT(3), with set C = 
{ci, . . . , Cm} of clauses, and set X = xi, . . . , of variables. We shall define 
an alphabet E and construct a set of strings W over such that W has a 



collision-free 2-partition if and only if is satisfiable. Let |q| denote the number 
of literals contained in the clause q and let , . . . , c • " be the literals of clause 

We construct W to be a union of three types of strings: clause strings (C), 
enforcer strings (f ) and forbidden strings {F). First, for each clause of ^, we 
create a clause string C such that an equality-free 2-partition of C unambiguously 
selects exactly one literal from C. We refer to the selected strings corresponding 
to literals as selected literals. Intuitively, the selected literals of the clause strings 
are intended to be a satisfying truth assignment for the variables of (j). Second, 
for each variable we create an enforcer string to ensure that selected literals are 
consistent. Specifically, the enforcer strings ensure that a positive and a negative 
literal for the same variable cannot be simultaneously selected. Finally, we find 
it helpful to create so called forbidden strings that ensure certain strings cannot 
be selected in the clause and enforcer strings. 

We construct an alphabet i7, formally defined below, which includes a letter 
for each literal occurrence in the clauses, one letter for each variable, and the 
letters B and ffl used as delimiters. 

E = {xi; XieXjU {ci; q G C A 1 < j < |q|} U {B, B} 

Note that \U\ is linear in the size of the 3SAT(3) problem cj) (at most n + 
3m + 2). 

Construction of forbidden strings: To ensure that certain strings cannot be se- 
lected in C or f, we will use the following set of forbidden strings T = {B, B}. 

Observation 1 No string from the forbidden set T can be selected in C or S. 

Construction of clause strings: For each clause q G C, construct the i-th clause 
string to be c] B cf if |q| = 2, and cj B cf B cf if \ci\ = 3. 

c] B c] B c\ B c\ 
I 1 1 I I 1 1 1 1 I 

I 1 1 I I 1 1 1 1 I 

I 1 1 1 1 I 

Fig. 4. The 2-literal clause string (left) and 3-literal clause string (right) used in the 
reduction from 3SAT(3) to EF-MSP. Shown below each string are all valid 2-partitions. 
Selected literals of a partition are shown in red. 

Lemma 1. Given that no string from the forbidden set T is selected in C, exactly 
one literal letter must be selected for each clause string in any equality-free 2- 
partition of C. 

Proof. Consider the clause string for clause q. Whether q has two or three 
literals, the forbidden substring B cannot be selected alone. Therefore, each B 
must be selected with an adjacent literal letter. This leaves exactly one other 
literal letter which must be selected (see Figure 4). □ 



Construction of enforcer strings: We must now ensure that no literal of ^ that is 
selected in C is the negation of another selected literal. By definition of 3SAT(3), 
each variable appears exactly three times: twice positive and once negated. Let 
and be the two positive and the negated occurrences of a variable Xy. Then 
construct the enforcer string for this variable as follows ffl c^x^c^x^c^ ffl cj. 

ffl Xv Xv ffl Cj 



I 1 1 1 1 1 1 1 1 1 1 

I I I I I I I I I I I 

I I I I I I I I I I L 

I 1 1 1 1 1 1 1 1 1 1 

I I I I I I I I I I L 

I I I I I I I I I I L 

I I I I I I I I I I L 

I I I I I I I I I I L 

I I I I I I I I I 



Fig. 5. All possible 2-partitions are shown for the enforcer string of a variable Xy having 
two positive literals cf and c^, and one negative literal c^. In each partition, either cl 
is selected or both and are which guarantees that letters for positive and negated 
literals of Xy cannot be simultaneously selected in C. 



Lemma 2. Given that no string from the forbidden set T is selected in C U £, 
any equality-free 2-partition of C U £ must be consistent. In addition, for any 
consistent choice of selecting letters for literals in C, there is an equality-free 
2-partition ofCyjSyjJ^. 

Proof. Consider the enforcer string for variable Xy with positive literals = 
= Xy^ and the negated literal = ^Xy. Figure 5 shows all 9 possible 2- 
partitions of the enforcer string (since ffl is a forbidden string, each ffl must be 
selected with an adjacent letter). It follows that in each of them either is 
selected or both and are. In the first case, cannot be selected in C and 
thus satisfied literals are chosen consistently for Xy. In the second case, letters 
for neither of the positive occurrences of Xy can be selected in C. 

To show the second part of the claim, observe that there is a 2-partition of 
the enforcer string compatible with any of four valid combinations of selecting 
letters for the corresponding literals in C (for example, by choosing the fifth or 
the last 2-partitions in Figure 5). Since enforcer strings share only one letter in 
common, namely, ffl, which is never selected in the enforcer strings, there are no 
collisions between 2-partitions of all enforcer strings. Furthermore, there are no 
collisions between strings selected in C and in £: strings of length two selected 
in C contain the letter H, which does not appear in the enforcer strings; strings 
of length one are literals and the partitioning of enforcer strings was chosen in 
a way that literals (in C) cannot be selected again in f . □ 

This completes the reduction. Notice that the reduction is polynomial as the 
combined length of the constructed set of strings W = CLJfUJ^isat most 
5m + 9n + 2. 



Theorem 1. Equality-Free Multiple String Partition (EF-MSP) is NP- complete 
for K = 2. 

Proof. It is easy to see that EF-MSP Problem is in NP: a nondeterministic 
algorithm need only guess a partition P where \pi\ < K for all pi in P and 
check in polynomial time that no two strings in P are equal. Furthermore, it is 
clear that an arbitrary instance (j) of 3SAT(3) can be reduced to an instance of 
EF-MSP, specified by a set of strings W = C U f U J^, in polynomial time and 
space by the reduction detailed above. 

Now suppose there is a satisfying truth assignment for (j). Simply select one 
corresponding true literal per clause in C. The construction of clause strings 
guarantees that a 2-partition of the rest of each clause string is possible. Also, 
since a satisfying truth assignment for (j) cannot assign truth values to opposite 
literals, then Lemma 2 guarantees that a valid partition of the enforcer strings is 
possible which does not conflict with the clause strings. Therefore, there exists 
an equality- free multiple string partition of W. 

Likewise, consider an equality- free multiple string partition of W. Lemma 

1 ensures that at least one literal per clause is selected. Furthermore, Lemma 

2 guarantees that if there is no collision, then no two selected variables in the 
clauses are negations of each other. Therefore, this must correspond to a satis- 
fying truth assignment for cj) (if none of the three literals of a variable is selected 
in the partition of C then this variable can have arbitrary value in the truth 
assignment without affecting satisfiability of 0). □ 

4.2 Equality- Free String Partition with Unbounded Alphabet 

Theorem 2. Equality-Free String Partition (EF-SP) is NP- complete for K = 
2. 

Proof. To show that EF-SP Problem ior K = 2 is NP-complete, we will reduce 
EF-MSP Problem for K = 2 to it. Consider an arbitrary instance / of EF-MSP 
having a set of strings W = {wi^W2^ • • • ^wi} over alphabet i7, and maximum 
partition size K = 2. We construct an instance I of EF-SP as follows. Let 
S = {□} U {(ii, for 1 < i < £}, where S H U = 9. Set the alphabet of / to 
£ = U U IJ and the maximum partition size to ^ = 2. Note that = + ^. 
Finally, construct the string 

= □ □ □ □ Bwidi □ BdiW2d2 BBd2... di-i B Bdi-iwi . 

The prefix of w of length five can be partitioned in two different ways each 
selecting □. Consequently, in any 2-partition of remaining occurrences of 
□ must be selected together with an adjacent letter different from □, i.e., all 
strings diB and Bdi must be selected. Therefore, any 2-partition of w contains a 
2-partition of W and the strings V = {□, □□, □□, diB, Bdi, . . . , di-iB, Bd^-i}. 
On the other hand, since all strings in V contain □ ^ i7, any 2-partition of w 
together with V forms a 2-partition of W. It follows that there is a 2-partition 
of W if and only if there is a 2-partition of w. The reduction is in polynomial 
time and space as \w\ = ||W|| +4^+1. □ 



4.3 Equality-Free Multiple String Partition with Binary Alphabet 

Theorem 3. The EF-MSP with maximum partition size K = 2 can be poly- 
nomially reduced to the EF-MSP Problem with the alphabet size L = 2. Con- 
sequently, the EF-MSP is NP- complete for binary alphabet. In addition, this 
reduction satisfies the following property: for any set C containing n distinct 
strings of length S, where n is the size of the alphabet of the EF-MSP with maxi- 
mum partition size K = 2 and 5 > log2 n, every selected word in a valid partition 
(if it exists) of the EF-MSP with the binary alphabet is a prefix of a string in 
C^, and its maximum partition size is K = 26. 

Proof. We will show a reduction from the EF-MSP with maximum partition 
size K = 2. Consider an arbitrary instance / of EF-MSP having a set of strings 
W = {wi^W2^ . . . ^wi] over alphabet U = {ai, . . . , a^}, and maximum partition 
size K = 2. We will construct an instance I of EF-MSP over binary alphabet 
E = {0, 1}. Let 5 be any number greater or equal to log2 n. Let C = {ci, . . . , c^} 
be a set of any distinct binary codewords of length 5. We set K to 25. Let h be 
a homomorphism from U to C such that h{ai) = for every i = 1, . . . , n. The 
set of strings of I will contain i.e., the original strings in W mapped by 

h to the binary alphabet E. However, we need to guarantee that the partition 
of strings in h{yV) does not contain fragments of codewords. For this reason, we 
also add to W the following strings: 

W = {prefix^(c); c G C, i = 1, . . . , ^ - 1} U 

{prefix^(cd); c,d e C,i = 6 ^ I, . . . ,26 - 1} 

We set W = h{W) U W. 

First, consider a valid 2-partition P of W. We construct a K-partition P of 
W as follows. For each string s selected in P, we select the corresponding h{s) 
in P. For each string t G W, we select t entirely. Note that stringy selected from 
hiyV) have length either 6 or 26, while strings selected from W have lengths 
different from 6 and 26. Therefore, there cannot be any collisions between these 
two groups of selected strings. Furthermore, there are no collisions in the first 
group, since there were no collisions in P. Obviously, there are no collisions in 
the second group of selected strings. It follows that P is a valid ^-partition of 
W. 

Conversely, consider a valid K-partition P of W. First, we will show that all 
strings in W are selected without non-trivial cut points. We will prove that by 
induction on the length i of strings. The base case, i = 1, is trivially true, as 
one-letter strings cannot be partitioned into shorter strings. Now, assume the 
claim is true for all strings mj/y of lengths smaller than i < 26 and different 
from 6. Consider a word u £ W of length i. Assume that u is partitioned into 
strings ui, . . . ,ut, where t > 2. Note that the length of ui is smaller than i. If 
the length of ui is different from 6, we have a collision, as iii G W and by the 
induction hypothesis, it was selected without non-trivial cut points. Assume that 
the length oi ui is 6. Then U2 is a prefix of a codeword of length smaller than 



min{(5, z}, and we have a collision again as in the previous case. It follows that 
t = 1, i.e., u is selected without non-trivial cut points in P. Second, we show 
that all strings selected in the partition of strings in h{yV) have lengths either 
5 or 25. Assume that this is not the case for some string s G /i(W). Note that 
s = Ci-^Ci^ • • • for some indices ii, . . . , Zp. Let s = si . . .Sq be the partition 
of s and let j be the smallest j such that the length of sj is not S oi 26. Then 
si . . . Sj-i = Ci^ . . . Ci^, for some r < p. Consequently, Sj is a prefix of Ci^_^^Ci^_^^, 
i.e., Sj G W, and we have a collision, since Sj was already selected in partition 
of yV. Hence, each string in h{W) is partitioned into strings of lengths either 6 
or 25, which can be easily mapped to a valid 2-partition of W. 

It follows that there is a 2-partition of W if and only if there is K-partition 
of W and that the reduction satisfies the property described in the claim. 

Finally, let us check that the reduction is polynomial. The size of /i(>V) is 
|>V| and the length of h{yV) is S\\yV\\. The size of W— the set of ah unique 
prefixes for codewords of length less than 6, and all unique prefixes of pairs of 
adjacent codewords with length greater than S and less than 25 — is at most 
(n^ -\-n){5 — 1) as there are n codewords in total. Therefore, the length of >V is 
at most n •(! + •• - + (5-1) +n2. (^ + 1 + ^ + 2 + ...+_2^-l) = {3n'^ + n){5 -1)5/2. 
Since 5 can be chosen to be 0(logn), the size of W is polynomial in the size of 
W and the size of the original alphabet U. □ 

4.4 Equality-Free String Partition with Binary Alphabet 

Theorem 4. Equality- Free String Partition (EF-SP) Problem is NP- complete 
for binary alphabet (L = 2). 

Proof. We will show a reduction from the EF-MSP Problem with the binary 
alphabet (L = 2) satisfying properties listed in Theorem 3. Consider an instance 
/ of EF-MSP having a set of strings W = {wi,W2^ . . . ,w^} over alphabet U = 
{0, 1}, and maximum partition size K = 25 such that all selected words in any 
valid K-partition are prefixes of the elements of a set C^, where C contains 
n distinct strings of length 5 each starting with 0, ^ < {v? -\- n){5 — 1), and 
5 > max(9, 3 log2(n + 1)). By Theorem 3, this instance can be polynomially 
reduced to an instance of the EF-MSP with maximum partition size K = 2. We 
will construct an instance / of EF-SP over binary alphabet £ = {0, 1} with the 
same partition size K = 25. We will show that the size of / is polynomial in the 
size of /, and hence, it will follow by Theorems 1 and 3, that the EF-SP Problem 
is NP-complete. 

To construct the string w we will interleave strings wi,. . . with delimiters 
c/i, . . . , d£-i defined in a moment as follows: 

w = widiW2d2Ws . . . di-iWi . 

To define the delimiter strings, we will need the following functions. Let bin : 
N {0, 1}* be a function mapping a positive integer to its standard binary 
representation without the leading one. For example bin(l) = bin (2) = and 



bin(lO) = 010. Next, the functions pad^ : {0, 1}* {0, 1}* will pad a given 
string with i — 1 ones and one zero on the left, i.e., pad^(5) = {iy~^Os. We 
will refer to strings returned by this functions as padded strings. The function 
chain : {0, 1}* — ^ {0, 1}* maps a string s with i trailing zeros, i.e., s = 5^(0)% 
where is either the empty string or a string ending with 1, to the following 
concatenation of padded strings and mirror images (reversals) of padded strings: 

chain(s) = pad|_|^| (s) pad^_|^| (5') pad|_|^| (5') pad^_|^| (s'O) pad|_|^| {s'O) 
...pad^_|,|(.'(Or^)pad|_|,|(.'(Ori)pad^_|,|(.). 

Finally, we set the delimiter dj to chain (bin (j)), for every j > 1. For j = 1, we 
set di to 0(1)^-^1)^^^"^^^^ To illustrate this definition, let us list the 
first five delimiter strings: 

di=0(l)^-^(l)^(^-i)/2(l)^-iO 

d2 = chain(O) = 00(l)^-'(l)^-200(l)^-2(l)^-200 

^3 = chain(l) = 10(1)^-2(1)^-^01 

d4 = chain(OO) = 000(l)^-3(l)^-300(l)^-3(l)^-30000(l)^-3(l)^-3000 
4 = chain(Ol) = 100(1)^-^(1)^-^001 

Now, consider a valid K-partition P of W. We construct a K-partition P 
of w as follows. Each substring wj is partitioned in the same way as in P. 
Each delimiter dj, where j > 1, is partitioned to its padded strings and mirror 
images of padded strings. In addition, the delimiter di is partitioned into one 
mirror image of a padded string, strings (l),(l)2,...,(l)^in any order, and one 
padded string. Note that all strings selected in Wj 's are prefixes of , and since 
each c e C has length S = K/2 and starts with 0, all these selected strings start 
with and the longest run of 1 they contain has length at most ^ — 1. Hence, they 
cannot collide with strings (1),(1)2,...,(1)^ and with padded strings which all 
start with 1. To show they do not collide with mirror images of padded strings, 
we will show that each padded string (or its mirror image) contains a run of at 
least 5 ones. By the definition of functions pad^, each padded string or its mirror 
image selected in a delimiter dj contains a substring (l)^~l i.e., a run 

of K - ([log2 j'l - 1) - 1 = K - [log2 j] ones. Since j < i < {n'^ ^ n){5 - 1), 
it is enough to show that \og2[{n^ -\- n){S — 1)] < S. This follows from the fact 
that S > 21og2(n + 1) + (5/3 and S/3> log2((5 - 1) for S > 9. Finally, we need 
to show that all selected padded strings and their mirror images are distinct. 
Note that each selected padded string starts with at least 5 ones and contains 
at least one zero, hence, it cannot be equal to a selected mirror image of padded 
string. Hence, it is enough to show that two delimiter dj and dj', where j, < £ 
do not contain the same padded string or its mirror image. Without loss of 
generality, let us only consider the padded strings. If bin(j) and bin(/) have 
different lengths then the padded strings of dj and djf start with (1)^-' ^^"'^•^^I^^O 
and (l)^~|bi"0' )l~iO, hence they cannot be equal. Therefore, assume they have 
the same length. Let s (respectively, s') be the prefix of bin(j) (respectively. 



bin(/)) without the traihng zeros. Clearly, s ^ s' . Now, the padded strings from 
dj and dj' are same only if = s'O* for some i and i' . However, since both s 
and s' end with one or one of them is the empty string, we must have i = i\ 
and hence also s = s\ a contradiction. Since the K-partition P of W was valid, 
it follows that the iC-partition of w is also valid. 

Conversely, consider a valid iiT-partition P of iD. It is enough to show that P 
super-selects each delimiter in w. We will show by induction on j that delimiters 
d\ , . . . , dj are super-selected and furthermore, that each of these delimiters is 
partitioned into its padded strings and mirror images of padded strings. For 
the base case j = 1, it is easy to see that P must select string 0(1)^~^, then 
strings (1)^, . . . , (1)^ in any order and string (l)^~-^0, and thus di is super- 
selected in P and its padded string and its mirror image of a padded string are 
selected. Next, assume that the induction hypothesis is satisfied for delimiters 
c/i, . . . , Consider delimiter string dj. First, we will show that dj contains 

cut points in P shown by -'s below: 

pad|_|^l (5) • pad^_|^| (5') pad|_|,| (/) • pad^_|^| (s'O) pad|_|^| (s'O)- 
.... pad^_|,| (.'(0)^-1) pad|_|,|(.'(0)^-^) • pad^_|,|(.) , 

where s = bin(j) and s' is the prefix of s without the trailing zeros and i is the 
number of trailing zeros. Note that each letter "•" is preceded and followed by 
K — I bin(j)| — 1 ones. Since | bin(j)| < ^ — 1, we have a run of at least K = 26 
ones, thus this run must contain a cut point. By contradiction assume that there 
is a cut point before the letter in this run of ones. Then the selected string 
starting at this cut point is in the form (l)^~*~^Oi/, where i < |bin(j)| and 
\u\ < i. Note that u might be the empty string and the selected string must 
contain the zero preceding u since all strings consisting only of ones are already 
selected in di. Let v = ii(0)*~l'"l. Since \v\ = i < \ bin(j)|, we have v = bin(/), 
where < j. The delimiter string dj' contains padj^_^(iA) = (l)^~*~"^Oi^, which 
by the induction hypothesis has been already selected. Analogously, we arrive 
into a contradiction, if there is a cut point after "•" in the run of ones surrounding 
the letter It follows that there is a cut point at each letter "•" above in P. 
Next, we show that each of super-selected strings of dj: 

pad^_|,|(s')pad^_|,|(s'),...,pad^_|,|(s' (0)^-1) pad^_|,|(s' (0)^-1), 

has a cut point exactly in the middle. The length of each padded string or of 
its mirror image is at least K — \s\ and since | bin(j)| < ^ — 1, this length is at 
least ^ -h 1. Hence, there has to be at least one cut point in each of the above 
super-selected strings in P. We will first prove the claim for the first super- 
selected string pad^_|g| (5') pad^_|g| (s^). By contradiction, and without loss of 
generality, assume that there is a cut point inside pad^_|g|(5') = (l)^~'*'~^05^ 
Thus a string in the form {l)^~\^\~^Ou, where u is a, proper prefix of s', is 
selected in P. Consider string v = i^(0) . Obviously, \v\ = \s\ and v is 
lexicographically smaller than 5, and thus bin(j') = v for some j' < j. By 
the induction hypothesis, string pad^_|^|(i^) = (1)^~I^I~-^0ia has been already 



selected in dj^^ a contradiction. It follows by straightforward induction on i 
that the remaining super-selected strings are partitioned exactly in the middle. 
Finally, observe that if there is a cut point inside pad^_|g|(s) then either one 
the padded strings of djf or one of the padded strings of dj described above 
is selected again . Similarly, there cannot be any cut point inside pad^_|g|(s). 
Since the length of these two strings is exactly there has to be a cut point 
just after pad^_|g|(s) and just before pad^_|g|(s), i.e., dj is super-selected. This 
completes the induction proof, and we have that all delimiter strings in w are 
super-selected by P, and thus P gives us also a partition of the set W. 

It follows that there is a iC-partition of W if and only if there is iC-partition 
of w. Finally, let us check that the reduction is polynomial. The length of each 
padded string or its mirror image is at most K. The length of di is K{K -\-3)/2 < 
K^. String bin(jf) for 1 < j < ^ has length at most S — 1^ and hence each dj 
contains at most 26 = K padded strings and mirror images of padded strings. 
Hence, \dj\ < K'^. Thus, the total length of w is at most ||>V|| + iK'^. □ 

5 Factor-, Prefix- and Suffix- Free String Partition 
Problems 

Here, we summarize the results for these partition problems. Their proof can be 
found in the appendix of this paper. 

Theorem 5. Both Factor-Free Multiple String Partition (FF-MSP) and Factor- 
Free String Partition (FF-SP) are NP-complete in the following two cases: (a) 
when the maximum partition size is 3; and (h) when the alphabet is binary. 

Theorem 6. Both Prefix (Suffix) -Free Multiple String Partition (PF-MSP) and 

Prefix (Suffix) -Free String Partition (PF-SP) are NP-complete in the following 
two cases: (a) when the maximum partition size is 2; and (b) when the alphabet 
is binary. 

6 Conclusion 

We have established the complexity of the following fundamental question: given 
a string w over an alphabet U and an integer can w be partitioned into 
factors no longer than K such that no two collide? We have shown this prob- 
lem is NP-complete for versions requiring that no string in the partition is a 
copy /fact or /prefix/suffix of another. Furthermore, we have shown the problems 
remain hard even for binary strings. This resolves a number of open questions 
from previous work [4] and establishes the theoretical hardness of a practical 
problem in contemporary synthetic biology, specifically, the oligo design for gene 
synthesis problem. 
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A Factor-Free String Partition Problems 



A.l Factor- Free Multiple String Partition with Unbounded 
Alphabet 

Let (j) be an instance of 3SAT(3), with set C = {ci, . . . ,c^} of clauses, and set 
X = xi, . . . , of variables. We shall define an alphabet U and construct a set 
of strings W over such that W has a factor-free 3-partition if and only if 
(j) is satisfiable. Let |q| denote the number of literals contained in the clause q 
and let cj^ , . . . , c' ' be the literals of clause q . 

We construct W to be the union of three sets of strings: clause strings (C), 
enforcer strings {£) and forbidden strings {T) with the same function as in the 
equality-free case. We construct an alphabet 17, formally defined below, which 
includes a letter for each literal occurrence in the clauses, three letters for each 
variable, and the letters and 1. 

r = {x{- G X A 1 < j < 3} U {c]'; Q G C A 1 < j < U {0, 1} 

Note that linear in the size of the 3SAT(3) problem (j) (at most 3m + 

3n + 2). 

Observation 2 Since every letter appears at least twice in W, no selected string 
can he a single letter. 



xl xl l<v <n 



Fig. 6. The set of forbidden strings, J", used in the reduction from 3SAT(3) to FF-MSP. 



Construction of forbidden strings: To ensure that certain strings cannot be se- 
lected in C or f, we construct a set of forbidden strings T as shown in Figure 6. 
Specifically, we forbid, for every variable any factor of the string x^Ox^. The 
number of strings in is n. 

Lemma 3. No string or factor of a string from the forbidden set T can be 
selected in C or E. 

Proof Consider any string / G J-". If / is split into two or three selected strings, 
a single letter is selected, which is not possible. Regardless of the construction of 
C and f , it follows that in any valid partition, since / is selected in J^, a factor 
of / cannot be selected in C nor in £. □ 



Fig. 7. The 2-literal clause string (left) and 3-literal clause string (right) used in the 
reduction from 3SAT(3) to FF-MSP. Shown below each string are all valid partitions. 
Selected literals for a partition are shown in red. 

Construction of clause strings: For each clause q G C, construct the i-th clause 
string to be c}c}Ocfcf, if |q| = 2 and c}c}OcfcfOc^c^, if |q| = 3. 

Lemma 4. Given that no factor of a string from the forbidden set T is selected 
inCUE, at least one literal must be selected for each clause string in any factor- 
free 3-partition ofW. 

Proof A literal letter cannot be selected alone without creating a collision. 
Therefore, we say a literal cj is selected in clause q if only if the string cjcl 
is selected in the clause string for Whether q has two or three literals, a 
single letter cannot be selected. A simple case analysis shows that in any valid 
partition at least one literal is selected (see Figure 7). □ 

Construction of enforcer strings: We must now ensure that no literal of that is 
selected in C is the negation of another selected literal. By definition of 3SAT(3), 
each variable appears exactly three times: twice positive and once negated. Let 
and be the two positive and the negated occurrences of variable Xy. 
Then construct three enforcer strings for this variable as shown in Figure 8. 

X y X y 1 ^fc^fc 1 1 ^ X y X y ^ X y X y ^ X y X y Q X y X y ^ X y X y ^ X y X y l ^ j l 
I II II I I II II II II I I II II II II I 

I II I I II II II II I I II II II II I 

Fig. 8. The enforcer strings for the three literals of variable Xy used in the reduction 
from 3SAT(3) to FF-MSP. The two positive literals are denoted as and and the 
negative literal as c^. If the negative literal is selected in C, then the enforcer string 
ensures neither positive literal can also be selected in C without creating a collision 
(top row). Likewise, if either or both of the positive literals are selected in C, then the 
negative literal cannot be selected without creating a collision (bottom row). 



Lemma 5. Given that no factor of a string from the forbidden set T is selected 
in CU E, any factor- free 3-partition of W must be consistent. 

Proof. Consider the three enforcer strings for some variable Xy with positive 
literals = = x^, and the negated literal = ^Xy shown in Figure 8. Note 
that the red strings in the middle of the last two enforcer strings are forbidden. 



and hence, no partition can have a cut point at the beginning or the end of the red 
string. Note also that the factor x^x^ cannot be selected, as then x^x^O or Ox^x^ 
has to be selected, which is obviously not possible. Suppose the negative literal 
is selected in C. Then the only partition which can be selected without creating a 
collision selects strings containing both c^c^ and cjcj as factors, thus forbidding 
them from being selected in C (see Figure 8 (top partition)). Likewise, suppose 
one or both of the positive literals is selected in C. Then only one partition of 
the first enforcer string is possible and it selects one string containing c^c^ as 
a factor, thus forbidding c^c^ from being selected in C (see Figure 8 (bottom 
partition)). Note that while these enforcer strings ensure literals selected in the 
clauses are consistent, it also ensures unwanted collisions do not occur since 
selected strings containing literals are prefixed or suffixed by a 1, a letter not 
used in the clause strings. Also note that the selected strings containing the 
variable letters do not collide. □ 

This completes the reduction. Notice that the reduction is polynomial as the 
combined length of the constructed set of strings >V = CUfUJ^isat most 
8m + 32n + 3n = 8m + 35n. 

Theorem 7. Factor-Free Multiple String Partition (FF-MSP) is NP-complete. 

Proof. It is easy to see that FF-MSP is in NP: a nondeterministic algorithm 
need only guess a partition P where \pi\ < K for all pi in P and check in 
polynomial time that no string in P is a factor of another. Furthermore, it is 
clear that an arbitrary instance (j) of 3SAT(3) can be reduced to an instance of 
FF-MSP, specified by a set of strings W = C U f UJ^, in polynomial time and 
space by the reduction detailed above. 

Now suppose there is a satisfying truth assignment for (j). Simply select one 
corresponding true literal per clause in C. The construction of clause strings 
guarantees that a 3-partition of the rest of each clause string is possible. Also, 
since a satisfying truth assignment for (j) cannot assign truth values to opposite 
literals, then Lemma 5 guarantees that a valid partition of the enforcer strings 
are possible which does not conflict with the clause strings. Therefore, there 
exists a factor- free multiple string partition of W. 

Likewise, consider a factor- free multiple string partition of W. Lemma 4 
ensures that at least one literal per clause is selected. Furthermore, Lemma 5 
guarantees that if there is no collision, then no two selected variables in the 
clauses are negations of each other. Therefore, this must correspond to a satis- 
fying truth assignment for (if none of the three literals of a variable is selected 
in the partition of C then this variable can have arbitrary value in the truth 
assignment without affecting satisfiability of 0). □ 

A. 2 Factor-Free String Partition with Unbounded Alphabet 

Lemma 6. A valid K-partition P of a string w having a{x)^^~'^ (3 as a factor, 
where a and f3 are single letters other than x, must select the strings a{x)^~^, 
(x)^, and {x)^-^p. 



Proof. Three strings are required to cover the factor 6 = a{x)^^~'^ f3. However, 
more than three strings cannot be selected to cover J, as otherwise at least two 
factors are selected consisting only of the letter x and must therefore collide. 
There is only one partition of S covered by three factors which must select 
a{x)^-\ (x)^, and {x)^-^p. 

Theorem 8. Factor-Free String Partition (FF-SP) is NP-complete. 

Proof. Consider an arbitrary instance / of FF-MSP having a set of strings 
W = {wi,W2, . . . ,Wn} over alphabet and maximum partition size K. We 
construct an instance r of FF-SP as follows. Let = {a} U {7^, for 1 < i < n}, 
where IJ U = 0. Set the alphabet of T to U' = U U IJ and the maximum 
partition size to K' = K. Note that = IZ'I + n. Finally, construct the 
string w' = wia{;^i)^^~'^aw20L{;^2)^^~^OL . . . a{'^n-i)^^~'^OLWn- The reduction is 
in polynomial time and space as \w'\ = X^w^gW I'^^I + 3i^(n — 1). By Lemma 
6, the factors a{ji)^^~'^a of string must be partitioned as a(7i)^~^, (7*)^, 
(7^)^~^a, for 1 < i < n. Since any string containing a letter 7i, 1 < i < n, 
cannot be a factor of any string in W it follows immediately that has a K^- 
partition if and only if W has a K-partition. □ 

A. 3 Factor-Free Multiple String Partition with Binary Alphabet 

In this section we are going to reduce the size of alphabet to 2. In order to do 
that we will map all letters of the original unbounded alphabet U except and 
1 to distinct binary strings of length t (t has to be large enough so that we 
have enough of strings), called codewords. Letters and 1 will remain mapped 
to and 1, respectively. Consequently, K will be set to 2t -\- 1. We will use the 
same clause strings and a simplified version of the enforcer strings found in the 
unbounded case (just mapped to the binary alphabet). We will use only two 
forbidden strings, 000 and 010, to force valid i^-partitions to cut the clause and 
enforcer strings just before or after the or 1 letter. At the end, we will show 
that this does not introduce any new collisions in the iC-partition corresponding 
to a truth assignment of the 3SAT(3) instance. 

Construction of codewords: We will use the codewords of the following type 
0(l)^0(l)*-3-^0, where i G {2, . . . , t-5}. To make sure we have enough codewords 
for all literal and variable letters (at most 3m + 2n), we have to choose t > 
3m + 2n + 6. 

Construction of forbidden string: We will use only two forbidden strings T = 
{000,010}. Obviously, the only factor-free partition of is without any non- 
trivial cut points. These two forbidden strings force any string containing uav 
as a factor, where u and v are codewords and a is a letter, to contain exactly 
one cut point around the letter a between u and v as formalized in the following 
lemma. 



Lemma 7. Any valid K -partitioning ofauavP andT, where u^v are codewords, 
a G {0,1} and a, (3 are arbitrary binary strings, contains either cut point \a\-\-\u\ 
or \a\ + + 1, but not both. 

Proof. Assume we have a valid partitioning of auav^ without cut points at 
positions \a\ + \u\ and \a\ + \u\ + 1. Since u ends with and v starts with 0, 
there is a selected string containing OaO as a factor, which is a forbidden string, 
a contradiction. □ 



Construction of clause strings: We will use the same clause strings as for the 
unbounded alphabet: 

cJcJOc^c? or clclOc^,c^,Oclct 

where c- are distinct codewords described above. We say that a literal c- is 
selected if c^c^ is super-selected in the K-partition. 

Lemma 8. In any fact or- free K -partition of C\J T, at least one of the literals 
in each clause is selected. 

Proof. By Lemma 7, there is exactly one cut point around each of the O's between 
codewords in the clause string. This means that the ilT-partition of the clause 
string follows one of the patterns depicted in Figure 7 with the exception that 
any of the selected strings depicted in the figure can be further partitioned, z.e., 
they are super-selected. The claim follows. □ 

Construction of enforcer strings: We will use a slightly simplified version of the 
enforcer strings as those used for the unbounded alphabet: 

xlxllclcll, Ic^cf 1x^x^0 and Oxix^lcjcjl, 

where and are codewords for the positive literals of a variable , IS 
the codeword for the negated literal of Xy and X y , X y are codewords for the 
variable letters of Xy . The difference from the unbounded case is that the factor 
x^x^^x^x^l can be safely removed from the second and third strings without 
changing the logic of the gadgets due to the property described in Lemma 7. 

Observation 3 In any factor-free K -partition no super- selected string can be a 
prefix (suffix) of any other super- selected string. 

Lemma 9. In any factor-free K -partition ofCUSUT, the selected literals are 
consistent. 

Proof. By Lemma 7, either xlx^ or c^c^l is super-selected in the first enforcer 
string. Note that if xlx^ is super-selected then by Observation 3, xlx^O and 
Oxlxy cannot be super-selected. Hence, if either xlx^O or Oxlx^ is super-selected 
then in the first enforcer string c^c^l is super-selected, and hence literal cannot 
be selected. 



By Lemma 7, either Ic^c^ or xlx^O is super-selected in the second enforcer 
string. In the first case, by Observation 3, literal cannot be selected; in the 
second case, by the above argument, literal cannot be selected. Similarly, the 
last enforcer string ensures that literals and cannot be selected at the same 
time. □ 



Theorem 9. Factor- Free Multiple String Partition Problem for the binary al- 
phabet (FF-MSP(2)) is NP-complete. 



Proof. It follows by Lemmas 8 and 9 that if there is a factor-free i^-partition of 
W = C U f U J-", then the selected literals produce a satisfying assignment for the 
3SAT(3) instance 0. 

Now, assume that there is a satisfying assignment for 0. Select a literal in 
each clause which satisfies it and partition all clause and enforcer strings ac- 
cordingly, ensuring literals selected in the clause strings are not selected in the 
enforcer strings. We will show that this iC-partition P is factor-free. Obviously, 
the forbidden strings 000 and 010 are not factors of any selected string. In the 
clause strings, the iC-partition P selects the following types of strings: Oaa, aaO 
and aa, where a is a codeword. In the enforcer string it selects the following 
types of strings: laa, aal, Oa6, a60, a61, a6, Oa, aO, la, al, where a, b are distinct 
codewords. It follows by the proof in the unbounded case that two strings where 
one is obviously a factor of the other, like aa and Oaa, are not selected at the 
same time. Hence, it is enough to show that no new collisions are introduced by 
mapping the original letters to the binary alphabet. 

Obviously, the strings of the same length are different as all codewords are 
different and the strings of different types (e.g., Oa6 and a60) would differ in the 
first two or last two letters. String ab (where we can have a = 6) is not a factor 
of some (Tcd {cda)^ where a ^ b ^ d are codewords and a G {0, 1}, as the 00 
in the middle of ab would have to exactly match 00 in the middle of acd {cda), 
which would imply ab = cd. 

Finally, we will show that era (and similarly, for aa) is not a factor of cd and 
a^cd and cda' , where a^c^d are codewords and a, a' are letters, unless a = a' 
and a = c. First, assume that cr = 1. Then a a contains two 101 factors with 
only I's between them. String cd contains exactly two 101 factors but there is 
a 00 factor between their occurrences in cd^ and the same is true for a'cd and 
cda' if a' = 0. Hence, assume that a' = 1 = a. Now, cda' contains a factor 
10(1)^01, but only at the very end, while in era this factor is followed by at least 
two letters. Hence, the only possibility is that aa is a factor of acd. However, 
pattern 10(1) +01 appears only at the beginning of acd^ and hence, a a would 
have to be a prefix of acd and then a = c. 

Second, assume that (7 = 0. Then a a starts with 00. Similar case analysis 
would show that aa can only be a factor of aad^ or of ca, a'ca or caa' . However, 
string Oa can be only selected from the second enforcer string {)xylbbl and x 
never appears in the second position of any selected string of the type cd^ a'cd, 
cda'. □ 



A. 4 Factor-Free String Partition with Binary Alphabet 



We will first design a sequence of strings which have to be selected no matter 
where they appear in the string we are partitioning. 

Lemma 10. Let K > 1 and for any i < K, let di = (l)*0(l)^~-^~*. Then any 
factor- free K -partition of 

w = Uido{l)^ d^U2didf . . . dN-2d^_2UN, 

where N < K/2 and ui, . . . ,un ctre arbitrary strings, selects the following strings 
(1)^, di, df, for every i = 0, . . . , TV - 2. 

Proof. Let P be a factor- free multiple K-partition of w. We will show by induc- 
tion on i that (1)^, do, . . . ,di and do", . . . ,df are selected. The base case i = 
follows by Lemma 6. For the inductive step, assume that (1)^, do,. . . , di-i and 
d^, . . . , df_^ are selected, where i < N — 2. We will show that di and df are also 
selected. The factor didf of w has length 2K, hence, there is at least one cut 
point inside it. Let j be the first such cut point. Obviously, j < K. Assume that 
j < K. Let Wp be the selected string starting at cut point j. We will consider 
two cases: 

Case 1. j > i -\- 1. Then Wp is a prefix of (1)^, a contradiction since (1)^ is 
already selected. 

Case 2. j < i. Then Wp is a prefix of di-j, a contradiction since di-j is already 
selected. 

Hence, the first cut point inside the factor didf of w is at position K. By 
symmetrical argument, this is also the last such cut point. It follows that both 
di and df are selected. □ 

Corollary 1. Let K > 1 and for any i < K, let di = (l)^0(l)^-^-\ Consider 
the string 

w = Uido{l)^ d^U2didf . . . dN-2dff_2UN, 

where N < K/2 and ui,...,un are arbitrary strings. If the string w has a 
factor-free K -partition then the sequence of strings ixi, . . . , un has a factor-free 
multiple K -partition. On the other hand, if the sequence ui, . . . ,un has a factor- 
free multiple K -partition such that each selected string contains at least two ^s 
then w has a factor- free K -partition. 

Proof. The first implication follows immediately by Lemma 10. The second im- 
plication follows by the fact that each delimiter contains only one 0; hence, none 
of the selected strings in i/i, . . . , un can be a factor of a delimiter. □ 

As the immediate consequence of Theorem 9 and Corollary 1, we have the 
following. 

Theorem 10. Factor-free String Partition Problem for the binary alphabet (FF- 
SP(2)) is NP-complete. 



B Prefix/SufHx-Free String Partition Problems 



All proofs presented in this section are for the prefix-free string partition prob- 
lems. The results for the suffix- free string problems follow by symmetry. 

B.l Prefix/ SufRx- Free Multiple String Partition with Unbounded 
Alphabet 

Similar to the equality- free and factor- free cases, we will show a polynomial 
reduction from an arbitrary instance of 3SAT(3). 

Let (j) be an instance of 3SAT(3), with set C = {ci, . . . , c^} of clauses, and 
set X = xi, . . . , of variables. We shall define an alphabet E and construct a 
set of strings W over Z"*, such that W has a prefix-free 2-partition if and only 
if (j) is satisfiable. Let |q| denote the number of literals contained in the clause 
Ci and let , . . . , c[^' ' be the literals of clause q . 

We construct W to be the union of three sets of strings: clause strings (C), 
enforcer strings {£) and forbidden strings (T) with the same function as in 
the equality-free and factor-free cases. We construct an alphabet i7, formally 
defined below, which includes four letters for each variable, a letter for each 
literal occurrence in the clauses and the letter $. 

U = {xj; G X A 1 < j < 4} U {ci; q G C A 1 < j < |q|} U {$} 

Note that is linear in the size of the 3SAT(3) problem (at most 4n -h 
3m + 1). 

Construction of forbidden strings: The forbidden set, J^, consists of the single 
string $$. Without loss of generality, we refer to this as the forbidden string. 

Lemma 11. No factor of the forbidden string can be selected in C nor in £. 

Proof. No proper factor of the forbidden string can be selected without creating 
a collision. Therefore, the entire string must be selected. Regardless of the con- 
struction of C and f , it follows that in any valid partition, since $$ is selected in 
J^, a factor of it cannot be selected in C nor in f . □ 

Construction of clause strings: For each clause q G C, construct the i-th clause 
string to be cj$cf, if |q| = 2 and c--$cf$cf, if |q| = 3. 

Lemma 12. Given that no factor of the forbidden string is selected in C U £, 
exactly one literal must be selected for each clause string in any prefix-free 2- 
partition ofW. 

Proof. We say a literal is selected in clause q if and only if the string 
is selected in the clause string for q. Whether q has two or three literals, the 
forbidden string $ cannot be selected alone. A simple case analysis shows that 
in any valid partition exactly one literal is selected (see Figure 9). □ 



Fig. 9. The 2-literal clause gadget (left) and 3- literal clause gadget (right) used in the 
reduction from 3SAT(3) to PF-MSP. Shown below each gadget are all valid partitions. 
Selected literals of a partition are shown in red. 



Construction of enforcer strings: We must now ensure that no literal of that is 
selected in C is the negation of another selected literal. By definition of 3SAT(3), 
each variable appears exactly three times: twice positive and once negated. Let 
and be the two positive and the negated occurrences of a variable Xy. 
Then construct two enforcer strings for this variable as shown in Figure 10. 



Fig. 10. The pair of enforcer strings for a variable Xv used in the reduction from 
3SAT(3) to PF-MSP. The two positive literals for variable Xv are denoted as and 
and the negative literal as . 



Lemma 13. Given that no factor of the forbidden string is selected in C U 
any prefix-free 2-partition of W must be consistent. 

Proof Consider the two enforcer strings for variable Xy with positive literals 
c^^ = = Xy, and the negated literal = ^Xy shown in Figure 10. 

Suppose the negative literal is selected in C. Then the only partition which 
can be selected without creating a collision selects strings containing both 
and Cj as a prefix, thus forbidding them from being selected in C (see Figure 10 
(top row)). 

Likewise, if one or both of the positive literals is selected in C then in any 
collision-free 2-partition a string is selected containing c^ as a prefix, thus for- 
bidding the negative literal from being selected in C (see Figure 10 (bottom 
row) ) . □ 

This completes the reduction. Notice that the reduction is linear as the com- 
bined length of the constructed set of strings W = CUSUT is at most 5m+8n+2. 

Theorem 11. Prefix (Suffix) -Free Multiple String Partition (PF-MSP) is NP- 
complete. 

Proof. It is easy to see that PF-MSP is in NP: a nondeterministic algorithm 
need only guess a partition P where \pi\ < K for all pi in P and check in 
polynomial time that no string in P is a prefix of another. Furthermore, it is 



clear that an arbitrary instance of 3SAT(3) can be reduced to an instance of 
PF-MSP, specified by a set of strings W = C U f UJ^, in polynomial time and 
space by the reduction detailed above. 

Now suppose there is a satisfying truth assignment for 0. Simply select one 
corresponding true literal per clause in C. The construction of clause strings 
guarantees that a 2-partition of the rest of each clause string is possible. Also, 
since a satisfying truth assignment for (j) cannot assign truth values to opposite 
literals, then Lemma 13 guarantees that a valid partition of the enforcer strings 
are possible. Therefore, there exists a prefix- free multiple string partition of W. 

Likewise, consider a prefix- free multiple string partition of W. Lemma 12 
ensures that exactly one literal per clause is selected. Furthermore, Lemma 13 
guarantees that if there is no collision, then no two selected variables in the 
clauses are negations of each other. Therefore, this must correspond to a satis- 
fying truth assignment for (j) (if none of the three literals of a variable is selected 
in the partition of C then this variable can have an arbitrary value in the truth 
assignment without affecting satisfiability of 0). □ 

B.2 Prefix/ Suffix- Free String Partition with Unbounded Alphabet 

To show the single string restriction of this problem is NP-complete, we design 
the same delimiter strings as specified in the factor-free construction. The result 
follows immediately by Theorem 11 and Lemma 6. 

Theorem 12. Prefix (Suffix) -Free String Partition (PF-SP) is NP-complete. 

B.3 Prefix/ Suffix- Free Multiple String Partition with Binary 
Alphabet 

In this section we start with the same construction as the multiple string un- 
bounded alphabet case to form a set of strings W, but show how the letters can 
be encoded into binary. We map the $ letter to 1 and map all others letters of 
the original unbounded alphabet U to distinct binary strings of length t, called 
codewords. Consequently, K will be set to 2t. We will establish that no codeword 
can properly contain a cut point. Furthermore, by design, no codeword is a prefix 
of another. Since the mapping to binary does not introduce new collisions, and 
since codewords cannot be cut in the middle, the correctness of the construction 
will follow from the results on the unbounded case. 

Construction of codewords: We use codewords of the form 00(1)*0(1)*~^~*0, 
where i G {2,...,t — 6}. To ensure we have enough codewords for all literal and 
variable letters (at most 3m + 4n), we have to choose t > 3m + 4n + 7. 

Construction of forbidden string: We will use the following set of forbidden 
strings: {11, 01, 101, 0001, 10001}. Considering only the forbidden set, a simple 
case analysis shows that each forbidden string must be entirely selected, other- 
wise a collision occurs. This set of forbidden strings ensures that no codeword 
can be cut in the middle. 



Lemma 14. Given that no strings selected inCUS have a forbidden prefix^ any 
prefix- free K -partition of W must not contain a cut point within a codeword. 

Proof Recall that codewords are of length t and that all length two binary 
strings are prefixes of a forbidden string and therefore cannot be selected. Let us 
consider any cut point beginning within an arbitrary codeword w. For any proper 
suffix of w longer than two, it contains a prefix in the set {Oil, 111, 101, 110}. 
Each of these contains a forbidden string as a prefix and therefore a cut point 
in w cannot begin prior to positions 2,3,. — 2. We must now show that a 
cut point cannot begin prior to position t — 1 or prior to position t. Recall that 
by construction, w is followed by either another codeword, the letter 1, or the 
empty string. If the empty string, it is not possible to a have a cut point in the 
position prior to t — 1 or t since a string will be selected that has length less 
than 3 and will therefore be a prefix of a forbidden string. If w is followed by 
the letter 1 a cut point prior to t — 1 or t will have a prefix in the set {101, 01}, 
both of which are forbidden strings. Finally consider the case that w is followed 
by another codeword and recall that all codewords begin with 001. If a cut point 
occurs prior to position t — 1, and the selected string beginning at that position 
has length at least five, then it will contain 10001 as a prefix which is a forbidden 
string; any shorter selection will be a prefix of the forbidden string 10001. If a 
cut point occurs prior to position t and the selected string has length at least 
four, it will contain 0001 as a prefix which is a forbidden string; any shorter 
selection will be a prefix of the forbidden string 0001. □ 

The above lemma ensures no codeword is divisible. The result is that the 
binary encoded instance r of an unbounded alphabet instance / can be parti- 
tioned exactly in the same relative positions as the original instance. Since each 
codeword cannot be a prefix of another by design, then correctness of the binary 
case immediately follows. 

Theorem 13. Prefix (Suffix) -Free Multiple String Partition (PF-MSP) is NP- 
complete for binary alphabet (L = 2). 

B.4 Prefix/ Suffix- Free String Partition with Binary Alphabet 

Similar to the factor- free case, we will design delimiters to join the set W of 
the multiple string case, into one string, without changing the possibilities for 
partitioning the original set of strings and without introducing new types of 
collisions. Specifically, we will create a new string instance / = WF where W 
is a string that concatenates all strings in W, expect from the forbidden set J^, 
using delimiters we describe below. The string F has a special construction to 
ensure the strings from the forbidden set must be selected in F for any collision- 
free partition of /. Thus, the new instance / will have a collision- free partition 
if and only if W does. 



Construction of delimiters: We design delimiters similar to the codewords of sec- 
tion B.3 that instead have length K. Specifically, they are of the form 00(1)*0(1)^~^~*0, 
where i e {2, . . . , K - 6}. 

Lemma 15. Given that no strings selected in W have a forbidden prefix, any 
prefix-free K -partition of W must select all delimiters. Furthermore, the delim- 
iters do not collide with any valid selection of the original strings from the set 
W. 

Proof. By Lemma 14 the delimiters, which are simply longer codewords, cannot 
contain a cut point in a middle position. Since they are of the maximum string 
length, they must be entirely selected in W. As a consequence of Lemma 14, 
in any valid partition of the set W, the selected strings are either: (i) single 
codewords, (ii) a codeword prefixed by a 1, (iii) a codeword suffixed by a 1, or 
(iv) two adjacent codewords. Since each case is no longer than a delimiter, it is 
sufficient to show that none could be a prefix of a delimiter. In all four cases, 
none could be a prefix as the selected string would contain at least four Os in 
the first K/2 + 1 positions, whereas a delimiter contains at most three. □ 

Construction of forbidden string: We now construct the string F = F4F3F2F1. 
The string is constructed in a meticulous manner to ensure that each sub- 
part must select forbidden strings from a different partition of the forbidden 
set In particular, Fi = 10^^~^111 and it forces 11 to be selected; F2 = 
OOlO^-^O^-^lllOlOOl^-^Ol and it forces 101 and 01 to be selected; F3 = 
001^-^010001 and it forces 10001 to be selected; and F4 = 001^-^010001 and 
it forces 0001 to be selected. 

Lemma 16. In any prefix-free partition of F the set of forbidden strings must 
be selected. Furthermore, there exists a prefix-free partition of F that does not 
collide with a valid partition of W assuming that W does not contain a forbidden 
prefix. 

Proof (Sketch). At a high level, each sub-part of F is constructed to ensure that 
one or more forbidden strings are selected, and the remainder of the sub-part 
consists of K length sub- strings that must be entirely selected. Furthermore, the 
construction of F ensures there is always a cut point between sub-parts and sub- 
part Fj is constructed with the knowledge of strings forbidden in F^, for all i < j. 
The proof requires a detailed and exhaustive case analysis. We give a sketch 
of the correctness. Consider sub-part Fi. It contains the 3i^- length substring 
a = 10^^-^ 1. At least three strings must be selected to cover a. However, it 
cannot be more than three as otherwise at least two contain only letters and 
therefore one must be a prefix of another. The only cover for a consisting of 
three strings must select 10^-^, 0^, and O^-^l, regardless of the sub-strings 
preceding or succeeding a. Since 1 cannot be selected alone (without being a 
prefix of 10^-^) the string 11 must be selected. Similar arguments, and the fact 
that 11 must already be selected, ensures that F2 is partitioned as 0010^-^, 
0^-^11, 101, 001^-^ 01, thus forbidding 101 and 01. For F3, since 101 and 11 



are already forbidden, there cannot be a cut point prior to any 1 within the left- 
most run of Is. Since 01 is forbidden, there cannot be a cut point in the second 
position, nor immediately after the left-most run of Is. It follows that 001^~^ 
must be entirely selected, regardless of the string preceding Fs. The remaining 
string 10001 must be entirely selected, otherwise it would conflict with 101 or 
01. Similarly, the partitioning of F4 ensures that 001^~^01 and 0001 must be 
selected. 

Note that all strings selected in F not in the forbidden set have length K. 
Thus, to show no new collisions have been introduced, it is sufficient to show that 
no string selected in a valid partition of VF, that does not contain a forbidden 
prefix, cannot be a prefix of one of these strings. As noted earlier, every selected 
string in a valid partition of W contains at least one codeword as a sub-string. 
Each codeword contains three runs of Os; however, each iC-length selected string 
in F contains no more than two runs of Os. □ 

By our straightforward polynomial time and space reduction of a binary PF- 
MSP instance into a binary PF-SP instance and by Lemmas 15 and 16 and 
Theorem 13 we have the following result. 

Theorem 14. Prefix (Suffix) -Free String Partition (PF-MSP) is NP- complete 
for binary alphabet (L = 2). 



